ROCm and HIP: A Detailed 10-Chapter Tutorial: The Memory-Centric Nature of GPU Performance

In GPU acceleration, we must abandon the "compute-first" mindset. Modern performance is dictated by Memory Management: the orchestration of data allocation, synchronization, and optimization between the host (CPU) and device (GPU).

1. The Memory-Compute Disparity

While GPU arithmetic throughput ($TFLOPS$) has skyrocketed, memory bandwidth ($GB/s$) has grown at a much slower rate. This creates a gap where the execution units are often "starved," waiting for data to arrive from VRAM. Consequently, GPU programming is often memory programming.

2. The Roofline Model

This model visualizes the relationship between Arithmetic Intensity (FLOPs/Byte) and performance. Applications typically fall into two categories:

Memory-Bound: Limited by bandwidth (the steep incline).
Compute-Bound: Limited by peak TFLOPS (the horizontal ceiling).

3. The Tax of Data Movement

The primary performance bottleneck is rarely the math; it is the latency and energy cost of moving a byte across the PCIe bus or from HBM. High-performance code prioritizes data residence and minimizes host-device transfers.

TERMINAL bash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of a GPU kernel being 'memory-bound'?

The clock speed of the GPU cores is too slow.

The rate of data delivery is slower than the rate of arithmetic execution.

There are too many threads running in parallel.

The CPU is faster than the GPU.

QUESTION 2

In the context of GPU programming, what does 'Memory Management' involve?

Only allocating variables on the CPU stack.

Controlling allocation, synchronization, and optimization of data transfer between host and device.

Optimizing the cache size of the L1 controller.

Manually cleaning the GPU registers after every kernel call.

QUESTION 3

Which axis of the Roofline Model represents 'Arithmetic Intensity'?

Vertical Axis (Y)

Horizontal Axis (X)

The slope of the line.

The area under the curve.

QUESTION 4

Why is redundant host-device transfer considered a 'performance tax'?

It consumes GPU registers.

Latency and energy consumption of moving data across PCIe is significantly higher than instruction execution.

It increases the floating-point precision error.

It causes the GPU to overheat instantly.

QUESTION 5

If a researcher's kernel spends 95% of its time 'stalled,' what is the most likely culprit?

The math instructions are too complex.

Inefficient orchestration of data residence causing the GPU to wait for data.

The GPU has too much VRAM.

The kernel was written in C++ instead of Python.

Case Study: The Climate Simulation Bottleneck

Optimizing a Fluid Dynamics Kernel

A research team is running a massive climate simulation. Their HIP kernel calculates fluid dynamics at high TFLOPS theoretically, but Profiling shows the GPU spends 95% of its time stalled. The team currently transfers data from Host to Device at every time-step.

Why does transferring data at every time-step likely cause the 95% stall?

Solution:
The PCIe bottleneck: The time taken to move data between Host RAM and Device VRAM via the interconnect is orders of magnitude slower than the kernel execution, forcing the GPU to wait (stall) for the next set of data.

Based on the axiom 'GPU programming is memory programming,' what should the team's first optimization step be?

Solution:
Strategic orchestration of data residence: The team should keep data on the GPU across multiple time-steps and only transfer results back to the host when necessary, minimizing 'redundant' transfers.